IBM System x GPFS Storage Server - stfc
IBM System x GPFS Storage Server - stfc
IBM System x GPFS Storage Server - stfc
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>IBM</strong> <strong>System</strong> x <strong>GPFS</strong> <strong>Storage</strong> <strong>Server</strong><br />
Crispin Keable<br />
Technical Computing Architect<br />
1<br />
© 2012 <strong>IBM</strong> Corporation
2<br />
<strong>IBM</strong> Technical Computing comprehensive portfolio uniquely<br />
addresses supercomputing and mainstream client needs<br />
Power<br />
<strong>System</strong>s TM<br />
Engine for<br />
faster insights<br />
<strong>IBM</strong> Platform<br />
LSF®<br />
Family<br />
<strong>IBM</strong> Platform<br />
HPC<br />
Pure<strong>System</strong>s TM<br />
Integrated expertise<br />
for improved economics<br />
<strong>GPFS</strong><br />
<strong>System</strong> x ®<br />
Redefining x86<br />
Blue Gene<br />
<strong>System</strong> x<br />
®<br />
Extremely fast,<br />
energy efficient<br />
supercomputer<br />
<strong>System</strong> <strong>Storage</strong>®<br />
Smarter storage<br />
HPC Cloud<br />
<strong>IBM</strong> Platform<br />
Symphony<br />
Family<br />
<strong>IBM</strong> Platform<br />
Cluster<br />
Manager<br />
solutions<br />
<strong>GPFS</strong><br />
<strong>Storage</strong> <strong>Server</strong><br />
Big data<br />
storage<br />
iDataPlex ®<br />
Fast, dense, flexible<br />
Technical<br />
Computing<br />
for Big Data<br />
Intelligent Cluster<br />
Factory-integrated, interoperability-tested system with<br />
compute, storage, networking and cluster management<br />
© 2012 <strong>IBM</strong> Corporation
“Perfect Storm” of Synergetic Innovations<br />
Disruptive Integrated <strong>Storage</strong> Software:<br />
Declustered RAID with <strong>GPFS</strong> reduces<br />
overhead and speeds rebuilds by ~4-6x<br />
Performance: POWER, x86 cores<br />
more powerful than special-use<br />
controller chips<br />
High-Speed Interconnect: Clustering &<br />
storage traffic, including failover<br />
(PERCS/Power fabric, InfiniBand, or 10GE)<br />
<strong>GPFS</strong> Native RAID<br />
<strong>Storage</strong> <strong>Server</strong><br />
Big Data Converging with HPC Technology<br />
<strong>Server</strong> and <strong>Storage</strong> Convergence<br />
Data Integrity, Reliability & Flexibility:<br />
End-to-end checksum, 2- & 3-fault<br />
tolerance, application-optimized RAID<br />
Integrated<br />
Hardware/Packaging:<br />
<strong>Server</strong> & <strong>Storage</strong> co-packaging<br />
improves density & efficiency<br />
Cost/Performance: Software-based<br />
controller reduces HW overhead & cost, and<br />
enables enhanced functionality.<br />
© 2012 <strong>IBM</strong> Corporation
High End<br />
POWER<br />
1 Rack performs a 1TB Hadoop<br />
TeraSort in less than 3 minutes!<br />
<strong>IBM</strong> <strong>GPFS</strong> Native RAID p775:<br />
High-Density <strong>Storage</strong> + Compute <strong>Server</strong><br />
• Based on Power 775 / PERCS Solution<br />
• Basic Configuration:<br />
• 32 Power7 32-core high bandwidth servers<br />
• Configurable as <strong>GPFS</strong> Native RAID storage<br />
controllers, compute servers, I/O servers or spares<br />
• Up to 5 Disk Enclosures per rack<br />
• 384 Drives and 64 quad-lane SAS ports each<br />
• Capacity: 1.1 PB/rack (900 GB SAS HDDs)<br />
• Bandwidth: >150 GB/s per rack Read BW<br />
• Compute Power: 18 TF + node sparing<br />
• Interconnect: <strong>IBM</strong> high-BW optical PERCS<br />
• Multi-rack scalable, fully water-cooled<br />
4<br />
© 2012 <strong>IBM</strong> Corporation
5<br />
How does GNR work?<br />
Clients<br />
File/Data <strong>Server</strong>s<br />
Custom Dedicated<br />
Disk Controllers<br />
NSD File <strong>Server</strong> 1<br />
x3650<br />
NSD File <strong>Server</strong> 2<br />
JBOD Disk Enclosures<br />
Migrate RAID<br />
and Disk<br />
Management to<br />
Commodity File<br />
<strong>Server</strong>s!<br />
Clients<br />
NSD File <strong>Server</strong> 1<br />
<strong>GPFS</strong> Native RAID<br />
NSD File <strong>Server</strong> 2<br />
<strong>GPFS</strong> Native RAID<br />
JBOD Disk Enclosures<br />
FDR IB<br />
10 GbE<br />
© 2012 <strong>IBM</strong> Corporation
6<br />
x3650 M4<br />
“Twin Tailed”<br />
JBOD<br />
Disk Enclosure<br />
A Scalable Building Block Approach to <strong>Storage</strong><br />
Complete <strong>Storage</strong> Solution<br />
Data <strong>Server</strong>s, Disk (NL-SAS and SSD), Software, InfiniBand and Ethernet<br />
Model 24:<br />
Light and Fast<br />
4 Enclosures, 20U<br />
232 NL-SAS, 6 SSD<br />
10 GB/Sec<br />
Performance based on IOR BM<br />
Model 26:<br />
HPC Workhorse!<br />
6 Enclosures, 28U<br />
348 NL-SAS, 6 SSD<br />
12 GB/sec<br />
High-Density HPC Option<br />
18 Enclosures<br />
2 - 42U Standard Racks<br />
1044 NL-SAS 18 SSD 6<br />
36 GB/sec<br />
© 2012 <strong>IBM</strong> Corporation
7<br />
<strong>GPFS</strong> Native RAID Feature Detail<br />
• Declustered RAID<br />
– Data and parity stripes are uniformly partitioned and distributed across a disk array.<br />
– Arbitrary number of disks per array (unconstrained to an integral number of RAID stripe widths)<br />
• 2-fault and 3-fault tolerance<br />
– Reed-Solomon parity encoding<br />
– 2 or 3-fault-tolerant: stripes = 8 data strips + 2 or 3 parity strips<br />
– 3 or 4-way mirroring<br />
• End-to-end checksum & dropped write detection<br />
– Disk surface to <strong>GPFS</strong> user/client<br />
– Detects and corrects off-track and lost/dropped disk writes<br />
• Asynchronous error diagnosis while affected IOs continue<br />
– If media error: verify and restore if possible<br />
– If path problem: attempt alternate paths<br />
• Supports live replacement of disks<br />
– IO ops continue on for tracks whose disks have been removed during carrier service<br />
7<br />
© 2012 <strong>IBM</strong> Corporation
Declustering – Bringing parallel performance to disk maintenance<br />
8<br />
� Conventional RAID: Narrow data+parity arrays<br />
– Rebuild can only use the IO capacity of 4 (surviving) disks<br />
4x4 RAID stripes<br />
(data plus parity)<br />
20 disks (5 disks per 4 conventional RAID arrays)<br />
� Declustered RAID: Data+parity distributed over all disks<br />
– Rebuild can use the IO capacity of all 19 (surviving) disks<br />
16 RAID stripes<br />
(data plus parity)<br />
Failed Disk<br />
20 disks in 1 Declustered RAID array<br />
Failed Disk<br />
Striping across all arrays, all file<br />
accesses are throttled by array 2’s<br />
rebuild overhead.<br />
Load on files accesses are<br />
reduced by 4.8x (=19/4)<br />
during array rebuild.<br />
8<br />
© 2012 <strong>IBM</strong> Corporation
9<br />
Declustered RAID Example<br />
7 stripes per group<br />
(2 strips per stripe)<br />
3 1-fault-tolerant<br />
mirrored groups<br />
(RAID1)<br />
3 groups<br />
6 disks<br />
spare<br />
disk<br />
7 spare<br />
strips<br />
7 disks<br />
21 stripes<br />
(42 strips)<br />
49 strips<br />
© 2012 <strong>IBM</strong> Corporation
10<br />
Rebuild Overhead Reduction Example<br />
time<br />
failed disk<br />
Rd Wr<br />
Rebuild activity confined to just<br />
a few disks – slow rebuild,<br />
disrupts user programs<br />
time<br />
Rd-Wr<br />
failed disk<br />
Rebuild activity spread<br />
across many disks, less<br />
disruption to user programs<br />
Rebuild overhead reduced by 3.5x<br />
© 2012 <strong>IBM</strong> Corporation
<strong>GPFS</strong> Native Raid Advantages<br />
11<br />
• Lower Cost!<br />
– Software RAID – No hardware<br />
storage controller<br />
• 10-30% lower cost with higher<br />
performance<br />
– Off-the-shelf SBODs<br />
• Generic low-cost disk enclosures<br />
• Standardized in-band SES management<br />
– Standard Linux or AIX<br />
– Generic high-volume servers<br />
– Component of <strong>GPFS</strong><br />
• Industry Leading Performance<br />
• Extreme Data Integrity<br />
– 2- and 3-fault-tolerant erasure<br />
codes<br />
• 80% and 73% storage efficiency<br />
– End-to-end checksum<br />
– Protection against lost writes<br />
– Fastest Rebuild times using<br />
Declustered RAID<br />
– Declustered RAID – Reduced app load during rebuilds<br />
• Up to 3x lower overhead to applications<br />
– Aligned full-stripe writes – disk limited<br />
– Small writes – backup-node NVRAM-log-write limited<br />
– Faster than alternatives today – and tomorrow!<br />
© 2012 <strong>IBM</strong> Corporation
12<br />
© 2012 <strong>IBM</strong> Corporation
Introducing <strong>IBM</strong> <strong>System</strong> x <strong>GPFS</strong> <strong>Storage</strong> <strong>Server</strong>:<br />
Bringing HPC Technology to the Mainstream<br />
� Better, Sustained Performance<br />
- Industry-leading throughput using efficient De-Clustered RAID Techniques<br />
� Better Value<br />
– Leverages <strong>System</strong> x servers and Commercial JBODS<br />
� Better Data Security<br />
– From the disk platter to the client.<br />
– Enhanced RAID Protection Technology<br />
� Affordably Scalable<br />
– Start Small and Affordably<br />
– Scale via incremental additions<br />
– Add capacity AND bandwidth<br />
� 3 Year Warranty<br />
– Manage and budget costs<br />
� IT-Facility Friendly<br />
– Industry-standard 42u 19 inch rack mounts<br />
– No special height requirements<br />
– Client Racks are OK!<br />
� And all the Data Management/Life Cycle Capabilities of <strong>GPFS</strong> – Built in!<br />
© 2012 <strong>IBM</strong> Corporation
14<br />
Declustered RAID6 Example<br />
14 physical disks / 3 traditional RAID6 arrays / 2 spares 14 physical disks / 1 declustered RAID6 array / 2 spares<br />
failed disks<br />
Decluster<br />
data,<br />
parity<br />
and<br />
spare<br />
failed disks failed disks<br />
Number of faults per stripe<br />
Red Green Blue<br />
0 2 0<br />
0 2 0<br />
0 2 0<br />
0 2 0<br />
0 2 0<br />
0 2 0<br />
0 2 0<br />
Number of stripes with 2 faults = 7<br />
failed disks<br />
Number of faults per stripe<br />
Red Green Blue<br />
1 0 1<br />
0 0 1<br />
0 1 1<br />
2 0 0<br />
0 1 1<br />
1 0 1<br />
0 1 0<br />
Number of stripes with 2 faults = 1<br />
© 2012 <strong>IBM</strong> Corporation
15<br />
Where <strong>GPFS</strong> <strong>Storage</strong> <strong>Server</strong> Fits<br />
Local<br />
University<br />
Petroleum<br />
Media/Ent.<br />
Financial<br />
Bio/Life SONAS Science<br />
DCS3700<br />
Services<br />
CAE<br />
DCS3700+<br />
Higher End<br />
Universities<br />
Direct Attached<br />
(DS3000 + V3700)<br />
Government<br />
<strong>GPFS</strong> <strong>Storage</strong> High End <strong>Server</strong><br />
Research<br />
© 2012 <strong>IBM</strong> Corporation
Data Protection Designed for 200K+ Drives!<br />
• Platter-to-Client Protection<br />
– Multi-level data protection to detect and prevent bad writes and on-disk data loss<br />
– Data Checksum carried and sent from platter to client server<br />
• Integrity Management<br />
– Rebuild<br />
• Selectively rebuild portions of a disk<br />
• Restore full redundancy, in priority order, after disk failures<br />
– Rebalance<br />
• When a failed disk is replaced with a spare disk, redistribute the free space<br />
– Scrub<br />
• Verify checksum of data and parity/mirror<br />
• Verify consistency of data and parity/mirror<br />
• Fix problems found on disk<br />
– Opportunistic Scheduling<br />
• At full disk speed when no user activity<br />
• At configurable rate when the system is busy<br />
16<br />
© 2012 <strong>IBM</strong> Corporation
Non-Intrusive Disk Diagnostics<br />
17<br />
17<br />
• Disk Hospital: Background determination of problems<br />
– While a disk is in hospital, GNR non-intrusively and immediately returns<br />
data to the client utilizing the error correction code.<br />
– For writes, GNR non-intrusively marks write data and reconstructs it<br />
later in the background after problem determination is complete.<br />
• Advanced fault determination<br />
– Statistical reliability and SMART monitoring<br />
– Neighbor check<br />
– Media error detection and correction<br />
© 2012 <strong>IBM</strong> Corporation
18<br />
GSS – End-to-end Checksums and Version Numbers<br />
• End-to-end checksums<br />
– Write operation<br />
• Between user compute node and GNR node<br />
• From GNR node to disk with version number<br />
– Read operation<br />
• From disk to GNR node with version number<br />
• From IO node to user compute node<br />
Data<br />
Checksum<br />
Trailer<br />
• Version numbers in metadata are used to validate checksum trailers<br />
for dropped write detection<br />
– Only a validated checksum can protect against dropped writes<br />
© 2012 <strong>IBM</strong> Corporation
19<br />
GSS Data Integrity<br />
� Silent data corruption<br />
– Caused by disk off-track writes, dropped writes (e.g., disk<br />
firmware bugs), or undetected read errors<br />
� Old adage: “No data is better than bad data”<br />
� Proper data integrity checking requires end-to-end<br />
checksum plus dropped write detection.<br />
read A<br />
disk returns A<br />
A<br />
read A<br />
disk returns B!<br />
© 2012 <strong>IBM</strong> Corporation
GNR / Mestor Future Research Directions<br />
• GNR “ring” configuration –<br />
– Adaptation of Building Block Approach<br />
– Shared (Dual Ported) disks<br />
• Data managed by storage nodes<br />
– Overlapping cfg: ½ the nodes of std ctlrs<br />
• <strong>Storage</strong> node pairs to shared disks<br />
• Scale out to many storage nodes<br />
• Global namespace / Disk mgmt<br />
• Mestor<br />
– Non Shared disks approach / Network RAID<br />
• Data striped across storage nodes<br />
• <strong>Storage</strong> node to captive disks<br />
• Scale out to many storage nodes<br />
• Global namespace / Disk mgmt<br />
20<br />
<strong>IBM</strong> Confidential<br />
Fabrics<br />
© 2012 <strong>IBM</strong> Corporation
<strong>GPFS</strong> Native RAID for <strong>System</strong> x Proposed Timeline<br />
V1.0: Getting Started<br />
-<strong>System</strong> x Intelligent Clusters “Solution”<br />
-Ordered Through <strong>System</strong> x Int. Cluster process<br />
-Software installed and configured at customer<br />
location by end-user or <strong>IBM</strong> Services<br />
-Support coordinated by Intelligent Clusters<br />
-Early Access customers<br />
2012 2013 2014<br />
21<br />
First Customer Ship<br />
ISC12<br />
V1.5<br />
-Solution sold via Intelligent Clusters<br />
-Bug Fixes<br />
-Support provided via I.C. Standard<br />
mechanism<br />
-Upgrade path for current DCS3700<br />
Customers defined<br />
-Drive Roll for New Drives (4 TB NL-SAS)<br />
SC12 v1.0<br />
Announce<br />
ISC13<br />
V1.5<br />
Announce<br />
V2.0<br />
Complete Machine-Type/Model, Fully Supported<br />
-Plug-n-Play <strong>GPFS</strong> Appliance<br />
-GUI for Management<br />
-Evaluate smaller form factor<br />
-12 Drives Enclosures?<br />
<strong>IBM</strong> Confidential<br />
V2.5 Miniaturization Release<br />
-Support Entry-level Based<br />
upon MESTOR<br />
-<strong>Storage</strong>-Rich <strong>Server</strong>s (internal<br />
drives)<br />
-RAID Across the <strong>Server</strong>s<br />
2015<br />
© 2012 <strong>IBM</strong> Corporation